##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## There is a binary version available but the source version is
## later:
## binary source needs_compilation
## MASS 7.3-51 7.3-51.1 TRUE
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/fd/py8plpb92y1254qstkg2ytnr0000gn/T//Rtmphv6SPX/downloaded_packages
Below are the statistic summary for the dataset
names(wqw)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
str(wqw)
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
summary(wqw)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Our dataset consists of 13 variables, with total 4,898 observations.
#bar chart for quality
ggplot(aes(x = quality), data = wqw) +
geom_bar()
#create a new variable for further investigation
wqw$quality.factor <- factor(wqw$quality, ordered=TRUE)
The majority of white wine quality is at 5 and 6. And a new variable, quality.factor is created for further investigation.
#bar chart for fixed acidity
ggplot(aes(x = fixed.acidity), data = wqw) +
geom_bar()
#bar chart for fixed acidity, outliers is removed
ggplot(aes(x = fixed.acidity), data = wqw, binwidth = 50) +
geom_bar() +
scale_x_continuous(breaks = seq(2, 12, 2)) +
coord_cartesian(xlim = c(2, 12))
#bar chart for volatile acidity
ggplot(aes(x = volatile.acidity), data = wqw) +
geom_bar()
#bar chart for volatile acidity, outliers is removed
ggplot(aes(x = volatile.acidity), data = wqw) +
geom_bar() +
scale_x_continuous(breaks = seq(0, 0.6, 0.1)) +
coord_cartesian(xlim = c(0, 0.6))
#bar chart for citric acid
ggplot(aes(x = citric.acid), data = wqw) +
geom_bar()
#bar chart for citric acid, outliers is removed
ggplot(aes(x = citric.acid), data = wqw) +
geom_bar() +
scale_x_continuous(breaks = seq(0, 0.8, 0.2)) +
coord_cartesian(xlim = c(0, 0.8))
Fixed acidity, volatile acidity and citric acid are all related to acid and therefore their graphs look alike. After removing the outliers, all three graphs how a quite normal distribution, as from the summary of the dataset, the median is very close to the mean for all the three variables.
#bar chart for residual sugar
ggplot(aes(x = residual.sugar), data = wqw) +
geom_bar()
#bar chart for residual sugar, using logscale to transform
ggplot(aes(x = residual.sugar), data = wqw) +
geom_bar() +
scale_x_log10() +
coord_cartesian(xlim = c(0.1, 30))
The original distribution of residual sugar is heavily skewed to the left. After transforming the x-axis using logscale, it appears a bimodal distribution.
#bar chart for chlorides
ggplot(aes(x = chlorides), data = wqw) +
geom_bar()
#bar chart for chlorides, outlier is removed
ggplot(aes(x = chlorides), data = wqw) +
geom_bar() +
scale_x_continuous(breaks = seq(0, 0.1, 0.01)) +
coord_cartesian(xlim = c(0, 0.1))
After removing the outlier of chlorides, we can see a normal distribution, as from the summary of the dataset, the median is very close to the mean.
#bar chart for free sulfur dioxide
ggplot(aes(x = free.sulfur.dioxide), data = wqw) +
geom_bar()
#bar chart for free sulfur dioxide, outlier is removed
ggplot(aes(x = free.sulfur.dioxide), data = wqw) +
geom_bar() +
scale_x_continuous(breaks = seq(0, 90, 10)) +
coord_cartesian(xlim = c(0, 90))
#bar chart for total sulfur dioxide
ggplot(aes(x = total.sulfur.dioxide), data = wqw) +
geom_bar()
#bar chart for total sulfur dioxide, outlier is removed
ggplot(aes(x = total.sulfur.dioxide), data = wqw) +
geom_bar() +
scale_x_continuous(breaks = seq(0, 300, 20)) +
coord_cartesian(xlim = c(0, 300))
It is found that the distribution of free and total sulfur dioxide is very similar, both have a quite normal distribution after removing outlier. But the amount of free sulfur dioxide in all white wines is relatively constant.
#bar chart for density
ggplot(aes(x = density), data = wqw) +
geom_bar()
#bar chart for density, bins of small values are combined to have a better visualisation
cut_density <- cut(wqw$density, breaks = 100, labels=1:100)
ggplot(aes(x = cut_density), data = wqw) +
geom_bar()
There is only a very narrow range of density for the white wines, approximately from 0.098 to 1.04. As there are many different value with small difference for density, we have cut all the values evenly into 100 groups, in order to have a better visualisation.
#bar chart for pH
ggplot(aes(x = pH), data = wqw) +
geom_bar()
pH shows a normal distribution, with values concentrated from 3 to 3.3.
#bar chart for sulphates
ggplot(aes(x = sulphates), data = wqw) +
geom_bar()
#bar chart for sulphates, outlier is removed
ggplot(aes(x = sulphates), data = wqw) +
geom_bar() +
scale_x_continuous(breaks = seq(0.2, 0.8, 0.1)) +
coord_cartesian(xlim = c(0.2, 0.8))
The distribution is quite normal after removing the outlier. And it is quite similar to that of free sulfur dioxide and total sulfur dioxide. The relationship between these three variables will be checked in later section of this project.
#bar chart for alcohol
ggplot(aes(x = alcohol), data = wqw) +
geom_bar() +
scale_x_continuous(breaks = seq(8, 14.2, 1))
The distribut of alcohol is alightly right skewed, and is concentrated from 9 to 11.
The data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. The 11 variables are fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulpahates and alcohol. All the 11 variables are numberic.
At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The quality of white wine is in integers.
The main feature of interest in the dataset is quality. This project is to explore which chemical properties influence the quality of white wines.
Alcohol, volitaile acid, free sulfur dioxide and total sulfur dioxide seems to affect the quality of white wines the most, according to the documentation. The relationship of them will be explored in the following section.
A new variable, quality.factor is created. It is a ordered factor which will be more useful for further investigation.
There are no missing values in the dataset.
For alcohol, the distribution is a bit right skewed. It is not adjusted as the skewness is not not extreme.
For residual sugar, the distribution is heavily right skewed. After transforming with logscale, it shows a bimodal distribution, with two peaks.
For the rest of the variables, there are many outliers. After removing the outliers, they all show a distribution similar to normal distribution. So it is not needed to perform any other transformation and adjustment.
Below correlation table has shown all the correlations between each variable.
corrgram(wqw, order=TRUE,
upper.panel=panel.cor, main="Correlation Between Chemical Properties in White Wine")
ggplot(aes(x = quality, y = alcohol), data = wqw) +
geom_boxplot(aes(color=quality.factor))
Alcohol has the highest correlation with quality (0.44). As shown above, white wines with quality 7, 8 and 9 have higher median alcohol level at 11% or above.
ggplot(aes(x = quality, y = density), data = wqw) +
geom_boxplot(aes(color=quality.factor)) +
coord_cartesian(ylim = c(0.98, 1.01))
Density has the second highest correlationwith quality, but in opposite direction. As shown above, median density tend to decrease as quality of white wine increases.
ggplot(aes(x = density, y = alcohol), data = wqw) +
geom_point(alpha = 1/5, position = 'jitter') +
coord_cartesian(xlim = c(0.98, 1.01))
Density has the highest correlation with alcohol, but in opposite direction. As shown above, it shows a clear negative correlation.
g1 <- ggplot(aes(x = density, y = residual.sugar), data = wqw) +
geom_point(alpha = 1/5, position = "jitter") +
coord_cartesian(xlim = c(0.98, 1.01), ylim = c(0,30)) +
geom_smooth()
g2 <- ggplot(aes(x = density, y = total.sulfur.dioxide), data = wqw) +
geom_point(alpha = 1/5, position = "jitter")+
coord_cartesian(xlim = c(0.98, 1.01)) +
geom_smooth()
grid.arrange(g1,g2, ncol=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
To further explore, we could see the positive correlation between density versus residual sugar and total sulfur dioxide.
g1 <- ggplot(aes(x = alcohol, y = residual.sugar), data = wqw) +
geom_point(alpha = 1/5, position = 'jitter') +
coord_cartesian(ylim = c(0, 30)) +
geom_smooth()
g2 <- ggplot(aes(x = alcohol, y = total.sulfur.dioxide), data = wqw) +
geom_point(alpha = 1/5, position = 'jitter') +
coord_cartesian(ylim = c(0, 350)) +
geom_smooth()
grid.arrange(g1,g2, ncol=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
To further explore, we could see the negative correlation between alcohol versus residual sugar and total sulfur dioxide.
ggplot(aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide), data = wqw) +
geom_point(alpha = 1/5, position = 'jitter') +
coord_cartesian(xlim = c(0, 100), ylim = c(0, 300)) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
From the above scatterplot, there is a positive correlation between free sulfur dioxide and total sulfur dioxide.
In this part, the correlation between quality and all chemical properties are found. alcohol (0.44) density (-0.31) chlorides (-0.21) volatile.acidity (-0.19) total.sulfur.dioxide (-0.17) fixed.acidity (-0.11) residual.sugar (-0.10) pH (0.1)
And the correlation of quality versus citric acid, sulphates and free sulfur dioxide is too small.
From the graph quality vs alcohol, we can distinguish white wine by the alcohol level. If the alcohol level is approximately 11% or above, we can conclude that the white wine has a very large chance to have quality at 7, 8 or 9.
From above, as alcohol and density have high correlation with white wine quality, the correlation between alcohol and density is also explored. It shows a high negative correlation(-0.78) between the two.
And to further explore, both residual sugar and total sulfur dioxide also have high positive correlation with density, and have high negative correlation with alcohol. This also implies the negative correlation between alcohol and density is true.
For Quality: Alcohol : 0.44 Density : -0.31
For non quality pairs: Density vs Residual Sugar: 0.84 Alcohol vs Density: -0.78
In this section, we try to explore more how to distinguish the white wine quality below 7, by exploring different chemical properties.
ggplot(aes(x = density, y = alcohol), data = wqw) +
geom_point(aes(color=quality.factor)) +
coord_cartesian(xlim = c(0.98, 1.01)) +
facet_grid(~quality.factor) +
geom_abline(intercept = 10.4, slope = 0)
From the above graph, a line of alcohol level at 10.4% is drawn. We could be more clear that all white wine with quality 9 have alcohol level equal to 10.4% or above. And a large proportion of white wine with quality 8 has alcohol level above this line as well.
ggplot(aes(x=alcohol,y=chlorides, colour=quality.factor), data = wqw) +
stat_smooth(method = loess)
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 10.388
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.1125
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.17016
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 10.388
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.1125
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 0.17016
From the above graph, we could find out that good quality white wine are with lowest level of chlorides and with alcohol level higher than 10.4%. Whereas that with higher level of chlorides and with alcohol level lower than 10.4% are most likely to be of bad quality.
ggplot(aes(x=alcohol, y=chlorides, color = quality.factor), data = wqw) +
geom_point() +
facet_wrap(~quality.factor) +
geom_smooth(colour='black')
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Computation failed in `stat_smooth()`:
## x has insufficient unique values to support 10 knots: reduce k.
Chlorides level above 0.15g/dm^3 are very likely to have white wine quality 5 and 6.
ggplot(aes(x=fixed.acidity,y=volatile.acidity, colour=quality.factor), data = wqw) +
geom_point() +
geom_smooth(color = 'black') +
facet_wrap(~quality.factor)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Computation failed in `stat_smooth()`:
## x has insufficient unique values to support 10 knots: reduce k.
From the above graph, we could see that white wines with good and bad quality(3, 7, 8, 9) usually have volatile acidity below 0.6g/dm^3. In other words, volatile acidity above 0.6g/dm^3 are likely to have quality at 4, 5 and 6. But actually there is no clear relationship between fixed acidity or volatile acidity versus quality of white wine.
ggplot(aes(x=alcohol,y=residual.sugar, colour=quality.factor), data = wqw) +
geom_point() +
coord_cartesian(ylim = c(0, 30)) +
geom_smooth(color = 'black') +
facet_wrap(~quality.factor)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Computation failed in `stat_smooth()`:
## x has insufficient unique values to support 10 knots: reduce k.
From above, it is found that if the residual sugar is more than 20g/dm^3, the quality of white wine must be 5 or 6.
ggplot(aes(x=total.sulfur.dioxide,y=alcohol, colour=quality.factor), data = wqw) +
stat_smooth(method = loess)
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 84.73
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 34.27
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 410.87
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 5
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 5
## Warning in sqrt(sum.squares/one.delta): NaNs produced
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 84.73
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 34.27
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 410.87
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced
From the above, if the level of total sulfur dioxide is high, it is more likely to be white wines with lower quality
ggplot(aes(x=density,y=residual.sugar, colour=quality.factor), data = wqw) +
geom_point() +
coord_cartesian(xlim = c(0.985, 1.005)) +
facet_wrap(~quality.factor)
From the above, if the density is above 1.0025g/cm^3 or , the quality must be at 6.
ggplot(aes(x=density,y=pH, colour=quality.factor), data = wqw) +
geom_point() +
coord_cartesian(xlim = c(0.985, 1.01)) +
facet_wrap(~quality.factor)
From the above, unfortunately there is no clear relationship between pH and quality of white wine.
ggplot(aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide, colour=quality.factor), data = wqw) +
geom_point(alpha = 1/2) +
coord_cartesian(xlim = c(0, 100), ylim = c(0, 300))
With the same level of free sulfur dioxide, higher level of total sulfur dioxide would normally have lower quality.
From the above graphs, we try find out how to distinguish the quality of white wines using chemical properties other than alcohol.
For density, the range is too narrow, which we could only distinguih the extreme density more than 1.0025g/cm^3 to be quality 6.
For chlorides, good quiality white wines usually have the lowest level of chlorides, whereas bad quality ones usually hav highest level of chlorides.
For residual sugar, only white wines with quality 5 and 6 would have residual sugar level more than 20g/dm^3.
For total sulfur dioxide, those of higher level of total sulfur dioxide should have lower quality.
For the correlation between free sulfur dioxide and total sulfur dioxide, it is determined from “Bivariate Analysis” that they have positive correlation. In this part, we further find out that if the white wine has the same level of free sulfur dioxide, that with a lower level of total sulfur dioxide tends to have higher quality.
ggplot(aes(x = quality, y = alcohol), data = wqw) +
geom_boxplot(aes(color=quality.factor)) +
ggtitle("Median Alcohol Level vs Quality")
This is the first graph that shows how to distinguish the quality of white wines. With a higher median alcohol level, the quality of white wine tends to be higher. In other words, high quality white wines usually have a higher percent of alcohol level.
ggplot(aes(x=alcohol, y=chlorides, color = quality.factor), data = wqw) +
geom_point() +
facet_wrap(~quality.factor) +
geom_smooth(colour='black') +
ggtitle("Chlorides vs Alcohol by Quality Factor")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Computation failed in `stat_smooth()`:
## x has insufficient unique values to support 10 knots: reduce k.
Good quality white wines tend to have lower level of chlorides(below 0.05g/dm^3) and alcohol level above 10.4%. If the chlorides level is very high(above 0.15g/dm^3), the quality of white wines are very likely to be at 5 or 6.
ggplot(aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide, colour=quality.factor), data = wqw) +
geom_point(alpha = 1/2) +
coord_cartesian(xlim = c(0, 100), ylim = c(0, 300)) +
ggtitle("Total Sulfur Dioxide vs Free Sulfur Dioxide by Quality Factor")
By perception and correlation table, the level of total sulfur dioxide dose not directly contribute to the quality of white wines. But it contributs much to the level of alcohol, with a negative correlation (-0.45).
From the documentation, total sulfur dioxide is equal to the sum of free sulfur dioxide and bound sulfur dioxide. When we explore the above graph, it is found that with the same level of free sulfur dioxide, lower level of total sulfur dioxide tends to have a higher quality of white wine. It also implies that a lower free sulfur dioxide ratio in the white wine would usually have higher quality.
The data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At the beginning of this analysis, the distribution of all the 11 variables are plot, and almost all of them show a quite normal distribution after removing outliers. Only that of residual sugar shows a heavily right skewed distribution.
It is found that alcohol has the highest correlation with quality (0.44). Then we have further explored on the relationship between different chemical properties. Alcohol is highly negatively correlated to density(-0.78). And density is positively correlated to residual sugar(0.84) and total sulfur dioxide(0.53). When we looked further into total sulfur dioxide, we found that a lower free sulfur dioxide ratio in the white wine would usually have higher quality.
Among all the graphs we plot, we could distinguish between good and bad quality white wine using alcohol, chlorides and free sulfur dioxide ratio. Besides by looking at a certain level of chlorides and residual sugar, we could appoximately asertain that the quality is at normal level(around 5 and 6).
To conclude, with current dataset, it is relatively easier to distinguish between good and bad quality when certain chemical properties is at an extreme level. But it is very difficult to distinguish white wine between normal and good/bad quality. It is because there are too many chemical properties in a bottle of white wine, which a little bit difference in some of the chemical properties may not affect the taste obviously.
Imagine that if the dataset can include the year that produce the white wine, production country or even the name of the chateau, and increase the number of people to taste the white wine, there will be more information to determine the correlation and may be easier to distinguish between different quality of white wine.